CMCS: contrastive-metric learning via vector-level sampling and augmentation for code search

Code search aims to search for code snippets from large codebase that are semantically related to natural query statements. Deep learning is a valuable method for solving code search tasks in which the quality of training data directly impacts the performance of deep-learning models. However, most existing deep-learning models for code search research have overlooked the critical role of training data within batches, particularly hard negative samples, in optimizing model parameters. In this paper, we propose contrastive-metric learning CMCS for code search based on vector-level sampling and augmentation. Specifically, we propose a sampling method to obtain hard negative samples based on the K-means algorithm and a hardness-controllable sample augmentation method to obtain positive and hard negative samples based on vector-level augmentation techniques. We then design an optimization objective composed of metric learning and multimodal contrastive learning using obtained positive and hard negative samples. Extensive experiments were conducted on the large-scale dataset CodeSearchNet using seven advanced code search models. The results show that our proposed method significantly enhances the training efficiency and search performance of code search models, which is conducive to promoting software engineering development.

Recently, some studies have introduced contrastive learning into code search [19][20][21] .Contrastive learning utilizes augmentation techniques to generate code snippets similar to positive samples and trains the model to distinguish the positive samples from the negative samples.In this way, contrastive learning improves the model's ability to capture the critical features of the samples.However, current research on contrastive learning for code search does not fully utilize the multimodal features of code snippets.Most focus solely on the semantic features of tokenized code sequences and syntactic features of code's Abstract Syntax Tree (AST) while overlooking aspects such as control flow features and data flow features [22][23][24] .Augmentation techniques are essential to the unsupervised training of contrastive learning, and existing augmentation methods for code are generally divided into text-level and vector-level methods.Text-level augmentation generates corresponding positive samples by rewriting the original code sample 25,26 , including variable renaming, inserting meaningless statements, reordering independent statements, etc. Vector-level augmentation generates corresponding positive samples by perturbing the representation vector of the anchor code sample, which includes methods such as linear interpolation and stochastic perturbation 27 .However, most existing contrastive learning methods still focus on text-level augmentation.Contrastive learning aims to obtain augmented positive sample vectors and train the model with negative sample vectors.Positive samples generated through text-level augmentation need to be represented as vectors  by the deep learning model.In contrast, vector-level augmentation directly generates positive sample vectors from the anchor sample vectors without representation.Therefore, text-level augmentation-based contrastive learning consumes more time and computational resources.Additionally, the negative samples in the minibatch for contrastive learning training are also randomly sampled as metric learning mentioned above, which also does not consider the beneficial influence of hard negative samples.Nevertheless, the augmentation techniques used for generating positive samples have inspired our idea of generating vector-level hard negative samples, which helps to overcome the limitation of relying solely on sampling to obtain hard negative samples.
To combine the benefits of metric and contrastive learning, and to fully leverage the hard negative samples and multimodal features of code, we propose an effective contrastive-metric learning for code search (CMCS) based on vector-level sampling and augmentation.CMCS can improve the training efficiency and effectiveness of code search models, which is achieved mainly through the following two key components.(1) Vector-level sampling and augmentation.Firstly, we fine-tune a pre-trained model based on metric learning and represent code snippets in the training set as vectors by integrating multimodal features.Secondly, we use the K-means algorithm to cluster these code vectors.Finally, based on the clustering results, we select code snippets of the anchor code sample's same category as hard negative samples according to vector similarity.For sample augmentation, we propose a vector-level sample augmentation method that can control the hardness, which is used to generate hard negative samples and positive samples.(2) Contrastive-metric learning.Based on the hard negative and positive samples obtained by sampling and augmentation, we propose contrastive-metric learning for code search, combining metric and multimodal contrastive learning.Since code and query belong to two different languages, we treat them as two modalities.In order to fully utilize the various features of the code, we further parse the code into tokens, AST, Control Flow Graphs (CFG), and Data Flow Graphs (DFG).Based on the parsing results, they are serialized and represented as the semantic, syntactic, control, and data flow features of the code, respectively.These features are regarded as different modal features of the code.Finally, we use contrastive learning to train the model for each modality features using hard negative and positive samples.This approach facilitates the model in better understanding and representing code and query.Meanwhile, we also train the model using metric learning with the obtained hard negative samples, it helps the model better learn the association for relevant code-query pairs.
The contributions of this work can be summarized as follows: • We propose a vector-level sampling method by K-means for hard negative samples, and a hardness-controlla- ble vector-level augmentation method for positive and hard negative samples.For the augmentation method, we propose a fine-grained random augmentation strategy to increase the diversity of feature patterns of the obtained samples and reduce the impact of overfitting.• Based on the sampling and augmentation method, we propose effective contrastive-metric learning for code search (CMCS).The contrastive learning part of this approach fully utilizes various features of the code and considers multiple modalities of input data, enabling the model to deeply learn the features of both the code and queries.The metric learning part enhances the model's ability to match relevant codes and queries.• We train CMCS on six programming languages separately with seven state-of-the-art deep code models, and the results demonstrate that CMCS can effectively improve the training efficiency and search performance of the models.
The rest of the paper is organized as follows.Section "Methodology" introduces the details of CMCS, including sampling and augmentation methods and contrastive-metric learning.Section "Experimental evaluation" presents experimental evaluation of CMCS's performance.Section "Related work" introduces related work including code search and data augmentation.In section "Conclusions", we conclude the paper and propose our future work.

Methodology
The overall architecture of CMCS is shown in Fig. 3a, which consists of the following four main components.
(1) The sampling method for hard negative samples uses the K-means algorithm to cluster the pre-represented code vectors and perform sampling based on the clustering results.The code vectors used for clustering are obtained by fusing the multimodal features of the code represented by the encoder and fusion module.
(2) The encoder separately represents the multimodal features of the inputs, including semantic features of the query, code semantic features, code syntactic features, code control flow features, and code data flow features.
(3) The hardness-controllable sample augmentation method based on vector-level augmentation generates positive and hard negative samples for contrastive-metric learning.(4) Contrastive-metric learning (CML) consists of metric learning (ML) and multiple contrastive learning (CL) for different modal features, which jointly optimize the model and improve code search performance.
The details of the sampling method, multimodal representation, augmentation method, and contrastivemetric learning are elaborated as follows.

The sampling method for hard negative samples
In this section, we introduce how to sample hard negative samples from the training set, which allows us to construct a mini-batch rich in hard negative samples.The sampling method is based on the vectors of code snippets.
In the sampling before each epoch, we use the encoder and fusion module updated after the last epoch to obtain vectors that fuse the multimodal code features.The specific multimodal representation and fusion details will be expounded in sections "The multimodal representations of the input" and "Metric learning for CMCS".Before the first epoch of training, we use the unoptimized encoder and fusion module to represent the code multifeatures based on metric learning, where the loss function fine-tuned by metric learning is as shown in Eq. (1): where V Qi represents an anchor query sample vector in a minibatch of size B, it generally has a corresponding semantically related code snippet vector V Ci as a positive sample.All other code snippet vectors V Cj (i ≠ j) in the batch are negative samples.
We specifically obtained the vectors of all code snippets in training set through steps ①④⑤⑥ shown in Fig. 3a.Subsequently, we use the K-means algorithm to cluster the code snippets into K clusters based on these vectors.The code snippets within the same cluster have a relatively close vector distance, as this algorithm classifies through vector distances.Since the functions of code snippets in the dataset are mutually exclusive, code snippets within the same cluster have different semantics but similar vectors, i.e., they are hard negative samples of each other.When constructing the mini-batch, we need to ensure that there are a certain number of hard negative samples in the minibatch to improve the training effect.Therefore, we only need to sample code samples equivalent to half of the batch size from the same cluster category, as they are hard negative samples for each other.To ensure the randomness and diversity of samples in the minibatch and to avoid overfitting, we randomly sample a number of negative samples from codebase equal to half of the batch size, which together with the sampled hard negative samples form a complete mini-batch, as shown in step ② of Fig. 3a.

The multimodal representations of the input
Deep learning-based code search explores the relationship between code and query in vector space, so the model's ability to capture and represent the features of code and query determines the performance of code search.Many current studies, especially those involving contrastive learning in code search, only represent the token sequence of code snippets, with the vector representing the overall features of the code snippets.However, code has more than just semantic features represented by token sequences.Processed code can present multi-dimensional features, such as syntactic, data flow, and control flow features.These various features describe code snippets from different dimensions.Since code and query are two types of languages, we regard them as two modalities.In addition to representing the query modality, CMCS refines the code modality, mining various modality (1)   3a describe the multimodal representation process of the input.For the sake of description, we assume that the batch size is 2, i.e., we operate on (C 1 , Q 1 ) and (C 2 , Q 2 ).For the multimodal representation of code, we first parse the code into code text segments, ASTs, CFGs, and DFGs, as shown in step ④ of Fig. 3a.Then, we convert them into sequences.Specifically, the code segment is tokenized to obtain token sequence, AST is preorder traversed to obtain tree sequence, and CFG and DFG are separately processed to extract edge information, resulting in control flow and data flow sequences.Finally, the obtained sequences are represented as vectors V t , V a , V c , and V d through the encoder, representing the semantic, syntactic, control flow, and data flow features of the code snippet, respectively.For the query, we only parse it into a token sequence for representation.As a result, we obtain the query modality feature V Q , and the above four code modality features V t , V a , V c , and V d .

The hardness-controllable sample augmentation method
In this section, we first introduce the four vector-level augmentation methods used in CMCS.Then, we explain how to obtain the anchor samples' positive samples and hard negative samples by controlling the augmentation hardness.Finally, we discuss the strategies adopted in the CMCS augmentation process.

Vector-level augmentation method
The essence of the vector-level augmentation method is to generate new sample vectors based on the anchor vector by perturbing the vector features.Existing vector-level augmentation methods mainly include linear interpolation, stochastic perturbation, binary interpolation, and Gaussian scaling.
(1) Linear interpolation 28 .Linear interpolation mainly uses the features of another sample to augment the anchor sample, and the method is calculated as Eq. ( 2).
where V i is the anchor sample vector, V j is randomly sampled from other samples.λ is the interpolation coefficient sampled from a uniform distribution U (α, β), and α, β are mutable parameters near 1.0.
(2) Stochastic perturbation 29 .Stochastic perturbation randomly deactivates some features of the sample to obtain a new sample, and the method is shown as Eq. ( 3).
where Vf i (e) represents the e-th dimension feature of the sample vector V i , assuming that ξ is sampled from a Bernoulli distribution B (e, ρ) to control whether the feature is deactivated.ρ is a small deactivating probability value.In implementation, the Dropout layer is generally used for stochastic perturbation.
(3) Binary interpolation 30 .Binary interpolation randomly swaps the feature Vf i (e) of the anchor sample vector with the feature Vf j (e) of another sample to generate a new sample vector as Eq. ( 4).
where ξ ~ B (e, ρ), control whether the feature is chosen to swap.ρ is a small deactivating (swapping) probability value.
Gaussian scaling augments the sample vector V i by scaling it by a small factor, which can be viewed as adding perturbation noise to the sample as Eq. ( 5).
where μ is the scaling coefficient sampled from a Gaussian distribution N (0, σ) with small values of σ.

Hardness-controllable augmentation
Regarding the positive and negative samples and contrastive learning, we have two observations: (1) In the vector space, the essential difference between positive and negative samples of an anchor sample is the vector distance.The vector distance between a positive sample and an anchor sample is small.The vector distance between a negative sample and an anchor sample is far.In fact, the difference between positive and negative samples, including hard negative samples, is mainly manifested in the vector distance from the anchor sample.(2) In contrastive learning, using data augmentation techniques to generate positive samples essentially means generating samples similar to the anchor samples.Vector-level augmentation techniques directly perturb the representation vectors to generate vectors with high vector similarity to the anchor vectors as positive samples.
Inspired by the above two observations, we use vector-level augmentation techniques to generate representation vectors with different similarities (hardness) to the anchor samples as positive and hard negative samples.To the best of our knowledge, we are the first to propose vector-level augmentation for hard negative samples.
Specifically, the four augmentation methods are all essentially based on perturbing the vector.Among them, the λ (controlled by α and β) of linear interpolation, ρ of stochastic perturbation, ξ (controlled by ρ) of binary interpolation, μ (controlled by σ) of Gaussian scaling all control the degree of perturbation, which determines the (2) similarity (hardness) between the augmented samples and the anchor samples.The four augmentation methods can be abstracted by the perturbation coefficients as an augmentation function, as shown in Eq. ( 6): where ξ ~ B (e, ρ) controls whether to import another sample, and θ ∈ (0, 1.0] represents hardness.The larger the hardness, the more similar the generated samples are to the anchor samples, which can be used as positive samples for contrastive learning.The smaller the hardness, the greater difference between the generated samples and the anchor sample, and they are considered as negative samples.Hard negative samples refer to the appropriate range of hardness value between the hardness value of positive and negative samples, and the optimal hardness value will be discussed in the experimental section.Therefore, we augment positive and hard negative samples by controlling the hardness of the augmentation.Our proposed CMCS samples the pairs of samples (C 1 , Q 1 ), (C 2 , Q 2 )…(C B , Q B ) to assemble the minibatch of size B. For a given sample pair (C i , Q i ), we first parse the code C i into a token sequence, an AST sequence, a CFG sequence, and a DFG sequence, and parse the query Q i into a token sequence.Then, the encoder represents these five sequences to obtain the corresponding five modal feature vectors V ti , V ai , V ci , V di and V Qi .For each modal feature vector, one of the four augmentation methods is randomly selected and used to augment the vector M1 and M2 times with randomly select from a certain range of perturbations under different hardness values to obtain M1 positive sample vectors V ti+ , V ai+ , V ci+ , V di+ and V Qi+ and M2 hard negative sample vectors V Hti , V Hai , V Hci , V Hdi and V HQi .Intuitively, as shown in the Fig. 3c, the dark blue circles and squares represent the original modal feature vectors.The light blue circles and squares represent the positive modal feature vectors obtained by augmenting the original vectors, and the purple circles and squares represent the hard negative modal feature vectors obtained by augmenting the original vectors.

Fine-grained random augmentation strategy
We have four augmentation methods as shown in Eqs. ( 2)- (5).Each method generates positive and hard negative sample vectors from a given anchor sample vector by controlling the perturbation coefficient according to different hardness value.The perturbation coefficient determines the degree of perturbation applied to the anchor sample vector by the augmentation method.The hardness mentioned in this paper abstracts the perturbation coefficient across all augmentation methods, represented as θ of Eq. ( 6).
The fine-grained random augmentation strategy refers to randomly applying different augmentation methods and randomly determining varying perturbation amounts under a given perturbation coefficient for each sample vector.This approach ensures the richness and diversity of the generated samples, reducing the likelihood of overfitting and enhancing the model's training performance.The fine-granularity of this strategy is manifested in the randomness it introduces in the perturbation magnitude under the corresponding perturbation coefficients assigned to different augmentation methods.In CMCS, we need to generate multiple positive sample vectors and hard negative sample vectors for the query semantic feature vector V Q and the multimodal feature vectors of the code, including the code semantic feature vector V ti , the code syntax feature vector V ai , the code control flow feature vector V ci , and the code data flow feature vector V di .The strategy introduced in this section involves randomly selecting one of the methods for each augmentation of different given sample vectors.It's important to note that the given perturbation coefficients only set the range for the perturbation's degree of the corresponding augmentation method.The specific perturbation amount within this range is variable.Therefore, the augmentation strategy proposed in this section not only randomly selects the augmentation method, but also randomly selects the perturbation amount within the range determined by the corresponding perturbation coefficients.This fine-grained random perturbation ensures the diversity of the generated sample's feature patterns.

Contrastive-metric learning
Based on the samples obtained through sampling and augmentation, this section will separately introduce metric learning, multimodal contrastive learning, and the overall process of contrastive-metric learning.

Multimodal contrastive learning for CMCS
Contrastive learning learns the features of the samples unsupervised by distinguishing the positive samples generated by the augmentation techniques.Code and query belong to the programming language and natural language, which have significant semantic and syntactic differences and can be considered two modalities.CMCS also parses code snippets into token sequences, ASTs, CFGs, and DFGs.Therefore, the modality of the code is further subdivided into semantic, syntactic, control flow, and data flow feature modalities.In this section, we use the positive and hard negative samples obtained through sampling and augmentation to perform contrastive learning separately on the query modality and four code modalities, thereby improving the model's ability to extract features from code and query.
After the samples in the mini-batch are parsed, serialized, and represented, we obtain five types of code modality feature vectors {V ti , V ai ,V ci ,V di }B i = 1 and query modality feature vectors {V Qi }B i = 1.CMCS performs contrastive learning for each modality feature individually, allowing the model to learn the features of each modality.Specifically, let V i represent one type of modality feature vector of the ith sample pair (C i , Q i ) in the mini-batch, and we perform augmentations to generate the corresponding positive sample set {{Vm i +}M1 m = 1} B i = 1 and hard negative sample set {{Vm Hi}M2 m = 1} B i = 1.As known from Section "The sampling method for hard negative samples", each mini-batch is composed of sampled hard negative samples and randomly sampled general negative samples.Therefore, in the mini-batch, other samples of the same modality feature V k (k ≠ i) are not only negative samples for V i , but also contain a certain proportion of hard negative samples for V i .In this (6) www.nature.com/scientificreports/way, in the contrastive learning training of each modality feature, there is a certain proportion of hard negative samples in the negative samples, which is more conducive to the effect of contrastive learning.The contrastive learning loss function for a certain modality feature is shown in Eq. ( 7): Therefore, we can obtain the contrastive learning loss function LC ti for query semantic modality features, LC ai for code semantic modality features, LC ci for code syntactic modality features, LC di for code control flow modality features, and LC Qi for code data flow modality features.
Intuitively, as shown in the CL part of Fig. 3b, the blue circles on the left represent a modality feature vector V 1 , and the light blue circles on the right represent the generated M1 positive modality feature vectors {Vm 1 +}M1 m = 1, the purple circles represent the generated M2 hard negative modality feature vectors {Vm H1 +}M2 m = 1, and the dark blue circles on the right represent the negative modality feature vectors V 2 from mini-batch.Each modality feature vector has M1 semantically related positive modality feature vectors connected by the solid line.The other vectors are all the negative samples connected by the dashed line.
In summary, we can calculate the contrastive learning loss values for the multimodal features of code, as well as for the query modal features.This multimodal contrastive learning approach improves the model's ability to understand and represent the features of code and query.
Next, as shown in Fig. 3d, we use a fusion module composed of a fully connected neural network to fuse the four modal feature vectors V ti , V ai , V ci and V di , as well as the augmented hard negative modal feature vectors Vm Hti, Vm Hai, Vm Hci and Vm Hdi of the same code sample C i .Eventually, we obtain the complete original code vector V Ci and the augmented hard negative code vector Vm HCi, both of which integrate all modal features.If Fusion denotes the fusion operation of the neural network, then the operation processes can be represented as Eqs.( 8) and ( 9).For a sample pair (C i , Q i ) in the minibatch, we obtain its representation vectors (V Ci , V Qi ) and the hard negative sample vectors {Vm HCi, Vm HQi}M2 m = 1 by augmentation.Metric learning aims to reduce the vector distance between related code query pairs and increase the distance between unrelated code query pairs.For a query, its unrelated code samples include not only the unrelated code samples in the mini-batch, but also the corresponding hard negative code samples generated by the augmentation method.Specifically, for the ith query vector V Qi , its semantically related code vector is V Ci , and the semantically unrelated samples are {Vm HCi} M2 m = 1 and V Cj (i ≠ j).We use dot product to calculate the vector similarity.The metric learning loss value of the query V Qi can be calculated as Eq.(10).
Intuitively, as shown in the ML part of Fig. 3b, the dark blue square on the left represents a certain query sample vector.The dark blue circles in the first row on the right represent the related code sample vectors of the query, and the purple circles in the second row represent the augmented M2 corresponding hard negative code sample vectors.The remaining dark blue circles represent other code sample vectors in the minibatch serve as negative samples.

Contrastive-metric learning
The previous two sections proposed corresponding loss functions from the perspectives of metric learning and multimodal contrastive learning, respectively.In order to combine the advantages of contrastive learning and metric learning, we integrate metric learning and multimodal contrastive learning to train the model's feature representation ability and its ability to match relevant code query pairs.For an minibatch of size B, the total contrastive-metric learning loss is obtained based on Eq. ( 11): where L represents the total loss of a minibatch.LM Qi represents the metric learning loss, LC Qi represents the query semantic modality contrastive learning loss, and LC ti, LC ai, LC ci, and LC di separately represent the code semantic, syntax, control flow, and data flow modality contrastive learning losses.The pseudocode for our proposed Contrastive-metric learning is shown below.( 7) Algorithm 1. Contrastive-metric learning algorithm.

Experimental evaluation
In order to investigate the structural rationality and the performance of CMCS on code search, extensive experiments are conducted to answer the following research questions: RQ1: How effective is CMCS in code search task?RQ2: How is the training efficiency of CMCS?RQ3: How do the different components of CMCS affect the model's performance?RQ4: Is it possible for the CMCS to overfit?How should this risk be reduced?RQ5: How to choose the optimal parameters to ensure high performance of CMCS?

Dataset
In this work, we use CodeSearchNet 32 large-scale dataset to evaluate the effectiveness of the proposed model.The dataset contains six different programming language data stored as semantically related code-query pairs.We filtered the dataset following the method of Guo et al. 33 , removing low-quality data (such as code that cannot be parsed into abstract syntax trees, code that is too long or short, or contains special characters).We also extracted www.nature.com/scientificreports/ a large number of code snippets according to the programming language type from the entire corpus to form the search codebase used for validation and test.The details of the preprocessed dataset are shown in Table 1.

Baselines
This paper compares the proposed CMCS with seven advanced code search models.Three models (SyncoBERT, CodeRetriever and CoCoSoDa) use contrastive learning techniques, two models (MRCS and TabCS) utilize the multimodal properties of code, and two models (CodeBERT and GraphCodeBERT) are pre-trained models fine-tuned on code search tasks.The baseline models are presented as follows.
MRCS 34 is is an advanced multimodal representation model proposed by Gu et al.The model proposes four tree-sequence representations generated based on traversal and sampling of the code.This paper uses Tokens + SBT, which has the best overall performance, as the multimodal input for code representation.
TabCS 35 is a two-stage attention-based code recommendation model proposed by Xu et al. 17 .Through a two-stage attention mechanism, code and query are represented based on the correlation between the features of inputs.
CodeBERT 36 is a bimodal pre-trained model for programming language and natural language.It is pretrained using two tasks: masked language modeling (MLM) and replaced token detection.
GraphCodeBERT 33 is a pre-trained model that incorporates code semantic structure information, and it has three pre-training tasks: MLM, code structure edges prediction, and alignment representations of source code and code structure.
SyncoBERT 21 is multi-modal contrastive pre-training for code representation.It takes source code, abstract syntax tree (AST) and summarization as input and pretrained with identifier prediction and AST edge prediction.
CodeRetriever 19 is a pre-training code model that learns the function-level code semantic representations through large-scale code-text contrastive pre-training.
CoCoSoDa 20 is a code search model that effectively uses contrastive learning.It proposes soft data augmentation and momentum mechanism to enhance the effect of contrastive learning, which are used to generate positive and negative samples at the text level, respectively.

Evaluation metrics
To evaluate the effectiveness of CMCS, we utilize the three most widely used metrics for code search: Normalized Discounted Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), and SuccessRate@k (SR@k).
MRR 8 measures the ranking of the target code in the returned code list, and only cares about the ranking of the most relevant code.The higher the MRR value, the higher the first hit code is ranked.
where Q is the number of query in the valid/test set, Rank j is the ranking position of the most relevant code in the returned list for the jth query.
NDCG 34 measures the similarity between the code recommendation list returned by the model and the ideal code recommendation list, and it considers the overall ranking of the the returned code list.A higher NDCG value indicates better overall ranked results.
where Q is the number of query in the valid/test set, r j is the relevance of the code at position j in the returned top-k search results to the query, and k denotes the maximum value that NDCG can give the query.
SuccessRate@k 35 measures the probability that the most relevant code is in the top-k of the returned code list.The higher the metric value, the higher the hit rate of the returned code list.www.nature.com/scientificreports/and the matching relationship between code and queries is more apparent, which is more beneficial to improve the performance of code search.The above experimental results and vector distribution all prove the positive role of CMCS in code search tasks.3.

RQ2: the training efficiency of CMCS
The results in Table 3 show that the convergence time of the CMCS model is significantly reduced compared to the six baseline models, but it slightly increases compared to the TabCS model.Since TabCS only uses a simple neural network structure for training, the number of model parameters is greatly reduced compared to CMCS and the other six baseline models, so it has the shortest training time.However, its code search performance is the weakest.In order to sample hard negative samples, CMCS performs representation and clustering before each epoch of training, which consumes a certain amount of time.However, this time consumption is meaningful.Benefiting from the hard negative samples derived from sampling and augmentation, they help CMCS quickly learn the features of the code and the query, as well as the correlation between the code and the query, greatly shortening its training time.From the above analysis, we can conclude that CMCS can significantly improve training efficiency while ensuring code search performance.In this section, we evaluate the rationality of the CMCS architecture, including the roles of main components in CMCS and the optimal usage strategy for the four augmentation methods.

Ablation experiment
We conducted ablation experiments on CMCS to evaluate the effects of metric learning (ML), contrastive learning (CL), multimodal contrastive learning (MCL), and the hard negative samples (Hard) on the code search performance using MRR, NDCG, and SR@1.The results are shown in Table 4. Experimental results show that removing any part of CMCS degrades the code search performance.It should be noted that the removal of MCL refers to the exclusive use of contrastive learning to study the semantic information modality of the code, and the removal of CL indicates the non-utilization of any contrastive learning module.The results indicate that the ML, MCL, and Hard components all play essential roles in the performance of CMCS, and the proposed architecture is relatively reasonable.We can draw the following conclusion: multimodal contrastive learning enables the model to learn more code and query features, metric learning enables the model to learn the matching relationship between code and query, and hard negative sampling benefits training performance.

The utilization strategy of augmentation methods
We have four augmentation methods: linear interpolation, stochastic perturbation, binary interpolation, and Gaussian scaling.We want to explore how to use the four augmentation methods to achieve optimal performance of CMCS.We conducted experiments on the CodeSearchNet to verify five expansion strategies using MRR: using a single method to augment samples (using the same expansion method each time), mixing all four methods to augment samples (randomly selecting one of the four methods each time), and the results are shown in Table 5.
The experimental results show that mixing all four methods to augment samples performs better than using a single expansion method.Since all vector-based augmentation methods essentially perturb vector features, we speculate that the mixed methods are more diverse in perturbation, producing hard negative and positive sample vectors that are more effective for model training.

RQ4: the effectiveness of measures to reduce the risk of overfitting in the CMCS
Since CMCS utilizes a K-means based sampling method to obtain hard negative samples and employs.However, the sampled data may lead to imbalanced categories within the training dataset.The augmentation technique may also result in the model overly relying on the sample features obtained through specific augmentation methods, thereby posing a risk of overfitting.This section explores CMCS's solution to mitigate the potential overfitting caused by sample sampling and augmentation.We also design relevant experiments to verify the effectiveness of the proposed measures.
The K-means based hard negative sample sampling method aims to select samples that are mutually hard negative examples in order to increase the proportion of hard negative samples in the mini-batch, thus enhancing the effectiveness of model training.CMCS adopts three measures to ensure the balance and diversity of samples within the constructed mini-batch, thereby reducing the risk of overfitting that may be caused by the introduction of hard negative samples.First, the sampled samples originate from the real dataset, ensuring the sample features' authenticity.Second, before the training of each new epoch begins, hard negative samples are re-clustered and re-sampled to construct a new mini-batch.Specifically, the encoder updated in the previous epoch is used to re-represent the code samples, followed by re-clustering and re-sampling.Finally, it is ensured that only half of the samples in the constructed mini-batch come from sampling, and the remaining samples are randomly sampled directly from the dataset.The controllable hardness augmentation method generates positive samples and hard negative samples for contrastive-metric learning.CMCS adopts a fine-grained random augmentation strategy to ensure the diversity of data patterns generated.Specifically, each augmentation in CMCS randomly selects a different augmentation method and the corresponding different amount of perturbation based on four different augmentation methods.
To verify the effectiveness of the above measures in reducing overfitting risk and improving model training performance for CMCS, we conducted experiments on the CodeSearchNet, removing one of the following four measures on CMCS, respectively.The experimental results are shown in Table 6.As can be seen from the results in the second and third rows of Table 6, randomly selecting one of multiple augmentation methods, as well as randomly selecting different perturbations for each augmentation method in the fine-grained random augmentation strategy, can both reduce overfitting and improve model training performance.This proves that the augmentation strategy can generate diverse samples, which is beneficial for balancing the training dataset.Moreover, augmentation techniques increase the quantity and diversity of the original dataset, inherently possessing the property of mitigating overfitting.The results in the fourth row of Table 6 indicate that adding a certain amount of random samples to the mini-batch is beneficial for balancing the types of samples and training the model.The last row of Table 6 demonstrates the benefits of multiple rounds of Table 5.The results for augmentation strategy of CMCS.Significant values are given in bold.
representation, clustering, and sampling for model training.Multiple rounds of clustering and sampling enhance the randomness and richness of the hard negative samples obtained from sampling in the minibatch and maintain the balance of data in the training minibatch.Moreover, the encoder updated from the previous epoch is used for vector representation before each clustering, enhancing the clustering accuracy.From the above analysis, it can be seen that the measures taken by CMCS can effectively reduce the risk of overfitting and enhance the effectiveness of model training.

RQ5: determining the optimal parameters for CMCS
The choice of the batch size Batch size refers to the number of samples processed at once during each training iteration.The choice of an appropriate batch size plays a crucial role in the training efficiency and performance of the model.If the batch size is too small, the model may only see a fraction of the data in each iteration, leading to high variance in the learning process.If the batch size is too large, the model sees more data in each iteration, which might require more computational resources and potentially degrade the model's generalization ability.In this section, we explore the optimal batch size to maximize the performance of CMCS.Due to GPU memory limitations, the maximum batch size in this experiment is 128.We conduct experiments on the CodeSearchNet Java dataset, evaluating with five metrics: MRR, NDCG, and SR@1/5/10.The batch sizes tested are 4, 8, 16, 32, 64, and 128, with the results shown in Table 7.The results indicate that a batch size of 32 yields the best code search performance.

The choice of the number of clusters K
The proposed hard negative sampling method requires K-means clustering on pre-embedded vectors to determine the category of each code snippet in the training set.The K-means algorithm is an unsupervised algorithm based on metric learning, which requires a predetermined number of clusters K.We set K to to values ranging from 4 to 56, with a step size of 4, to determine the optimal K value.We conducted experiments on the Code-SearchNet Java dataset, the results are shown in Table 8.
It can be observed from the results that CMCS performs best when K is set to 44.We speculate that with more centroids, the code snippets can be divided into more refined categories, resulting in hard negative samples from the same category that are more similar to the original samples, which is beneficial for model training.We also found from Table 8 that the model performance rapidly improved as K increased from 4 to 32, but the performance improvement slowed down and reached a plateau as K increased from 32 to 44.More clusters mean that

Figure 1 .
Figure 1.A general code search process based on deep learning.

Figure 3 .
Figure 3. Overall architecture of CMCS.(a) Overall process of CMCS.(b) Details of the CL and ML parts.(c) Details of the augmentation part.(d) Details of the fusion part.
Although large deep learning models have outstanding performance, they require substantial computing and time resources, so they generally take a long time to train.CMCS generates hard negative samples based on sampling and augmentation methods, which are beneficial for the model to correct errors more quickly during training and enable the model parameters to converge more rapidly.We use the total training time after the model has reached basic convergence to reflect the efficiency of training.The shorter the total training time, the higher the training efficiency of the model.We train each of the seven baseline models and the CMCS model on the Java dataset and record their training time (the unit is hour), and the results are shown in Table

Figure 4 .
Figure 4. Visualization of code and query vectors distribution.
A. The augmentation strategy is based on four methods and uses different amounts of perturbation.(After removing A, only one augmentation method is used, and the same amount of perturbation is applied for each augmentation.)B. The augmentation strategy is based on four methods.(After removing B, only one augmentation method is used, and different amounts of perturbation are applied for each augmentation.)C. The samples in the minibatch are composed of hard negative samples and randomly sampled negative samples.(After removing C, the minibatch is entirely composed of hard negative samples obtained through sampling.)D. Before each epoch, re-embedding, clustering, and sampling are performed.(After removing D, only one round embedding and hard negative sample clustering are performed, and random sampling is conducted in each epoch based on the initial clustering results.) features of the input code and representing them separately.Based on multimodal representation vectors, CMCS uses metric contrastive learning to enable the model to perceive, learn, and represent more comprehensive and accurate code features, thereby improving the matching accuracy of relevant code and queries.The steps ③ and ④ in Fig.

Table 1 .
The details of the filtered CodeSearchNet dataset.

Table 3 .
Training efficiency of CMCS compared with baseline models.

Table 4 .
The ablation experiment results of CMCS.Significant values are given in bold.

Table 7 .
The performance of CMCS with different batch size.Significant values are given in bold.

Table 8 .
The performance of CMCS with different K values.Significant values are given in bold.